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Abstract 

We revisit the problem of deciding whether a given string is uniquely decod- 
able from its bigram counts by means of a finite automaton. An efficient 
algorithm for constructing a polynomial-size nondeterministic finite automa- 
ton that decides unique decodability is given. Conversely, we show that the 
minimum deterministic finite automaton for deciding unique decodability has 
at least exponentially many states in alphabet size. 
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1. Introduction 

Reconstructing a string from its snippets is a problem of fundamental 
importance in many areas of computing. In a biological context this prob- 
lem amounts to sequencing of DNA from short reads ^] and reconstruction 
of protein sequences from K-peptides joj. Communications protocols 0, @] 
recombine snippets from related documents to identify differences between 



them, and fuzzy extractors [10[ use similar techniques for producing keys 
from noise-prone biometric data. Computational linguistics also makes occa- 
sional use of this snippet representation (under the name Wickelfeatures [l|), 
as a means to learn transformations on varying-length sequences. 

In general, there may be a large number of possible string reconstructions 
from a given collection of overlapping snippets; for example, the snippets {at. 
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an, ka, na, ta} can be combined into katana or kanata. In order to keep the 
decoding complexity and ambiguity low, it is desirable in practice to choose 
a snippet length that allows only a few distinct reconstructions — the ideal 
number being exactly one. 

Main results. We consider the problem of efficiently determining whether 
a collection of snippets has a unique reconstruction. More precisely, we 
construct a nondeterministic finite automaton (NFA) on 0(|Sp) states that 
recognizes precisely those strings over the alphabet S that have a unique 
reconstruction. Our NFA has a particularly simple form that provides for 
an easy and efficient implementation, and runs on a string of length i in 
time 0(£|Sp) and constant memory. We further show that the minimum 
equivalent deterministic finite automaton has at least 2l^l~^ states. This 



lower bound is still far off from the upper bound 2^(1^1 1^1) implicit in [JJ 
and closing this gap is an intriguing open problem. 

Related work. It was shown in j3| that the collection of strings having a 
unique reconstruction from the snippet representation is a regular language. 
An explicit construction of a deterministic finite-state automaton (DFA) rec- 



ognizing this language was given in by Lia and Xie Unfortunately, this 
DFA has 

2l^l(|E| + 1)(|S| + G 20(l^"°*^l^l) 

states, and thus is not practical except for very small alphabets. As we show 
in this paper, there is no DFA of subexponential size for recognizing this 
language; however, we exhibit an equivalent NFA with 0(|S|^) states. 

Outline. We proceed in Section |2] with some preliminary definitions and no- 
tation. In Section |3] we present our construction of an NFA recognizing 
uniquely decodable strings, and we prove its correctness in Section |H Fi- 
nally, we present a new lower bound on the size of a DFA accepting uniquely 
decodable strings in Section |5l and conclude in Section [6] with discussion and 
an open problem. 

2. Preliminaries 

We assume a finite alphabet E along with a special delimiter character 
$ ^ E, and define E$ = S U {$}. For k > 1, the fc-gram map $ takes string 
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X G to a vector ^ G N^s, where G N is the number of times 

the strine ii . . . ^ Tl' occurred in x as a contiguous subsequence, counting 
overlapsjj As we have seen, the bigram map $ : — N^s is not injective; 
for example, $($katana$) = $($kanata$). 

We denote by Luniq ^ S* the collection of all strings w for which 

<l>"^(<l>($w$)) = {%w%} 

and refer to these strings as uniquely decodable, meaning that there is exactly 
one way to reconstruct them from their bigram snippets. The examples 
$katan$ and $katana$ show that 7^ -Z^uniq 

7^ S* for |S| > 1. The induced 
bigram graph of a string w G S* is a weighted directed graph G = {V,E), 
with V = T,$ and E = {e(a, b) : a,b E where the edge weight e(a, b) > 
records the number of times a occurs immediately before b in the string $w$. 

We also follow the standard conventions for sets, languages, regular ex- 
pressions, and automata [2I, 0, 0|- As such, a factor of a string (colloquially 
a snippet) is any of its contiguous substrings. The term S* denotes the free 
monoid over the alphabet S, and, for S* C S, the term S* has the usual 
regular-expression interpretation; the language defined by a regular expres- 
sion R will be denoted L(R). In addition, we will denote the omission of a 
symbol from the alphabet by := S \ {x} for a; G S. 

Finally, we shall use the standard five-tuple j3| notation (S, Q, qq, 5, F) to 
specify a given DFA, where S is the input alphabet, Q is the set of states, go 
is the initial state, 6 is the transition function, and F are the final states; an 
analogous notation is used for NFAs. We use the notation | ■ | both to denote 
the size of an automaton (measured by the number of states) and the length 
of a string. 

3. Construction and simulation of the NFA 

3.1. Obstruction languages and their DFAs 



Our starting point is the observation, also made in that Luniq is 
a factorial langua ge, meaning that it is closed under taking factors. From 
here, Lia and Xie [ll| proceed to characterize Luniq in terms of its minimal 



^In this paper we will focus on the bigram case when fc = 2, although the general case 
k>2 readily follows 0,[il|. 
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E\{ti,6} E\{a} E\{i,6} E\{a} E\{i,a} 




S\{a} E\{a:.6} E\{a,6} E\{a;} E\{a} 



Figure 1: The canonical DFA for Kx,a,b, for a ^ 6 (left) and a — b (right); note that this 
DFA never has more than 9 states, regardless of alphabet size. 



forbidden words. Rather than looking at forbidden words, we will consider 
obstructions in the form of simple regular languages. 
For s G S and a, 6 G S^j, define 

4,,,, = L(E*axS*6S*). 

Thus, Ix,a,b is the collection of all strings w E T,* whose induced bigram graph 
has an edge from a to x and a directed path from x to 6 avoiding a. Similarly, 
for X G S and a, 6 G S^j, define 

J,,a,b = L (S*aS;6S*) . 

Thus, Jx,a,b the collection of all strings w G S* whose induced bigram graph 
has a directed path from a to 6 avoiding x. Finally, define an obstruction 
language 

-^x,a,b Ix,a,b ^ Jx,a,bi 

whose elements will be called obstructions. The language of all obstructions 
will be denoted 

-f'OBST = U U ^x,a,b- (1) 

The DFA recognizing a typical Kx^a,b is illustrated in Figured] One can verify 
that these DFAs indeed recognize Kx^a,b straightforwardly for S = {a,b,x}, 
and note that the automata continue to be correct for any S' D {a,b,x}. 
An important feature of Kx^a,b is that 9 states always suffice for its DFA, 
regardless of S (one can also check that the DFAs given in Figure [1] are 
canonical by applying the DFA minimization algorithm 0]). 
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3.2. The NFA as a union of obstructions 

For X G S and a,b e S^, let M^^a,b = (X', Qx,a,b, s^^a,b, F^,a,b,Sx,a,b) be the 
canonical DFA recognizing the obstruction language K^^afi- Observe that 
there are 

|S|(|E|-1 + (|S|-1)(|E|-2))gO(|S|^) (2) 

distinct obstruction languages. Indeed, there are |S| choices for x. If a = 6, 
we have |S| — 1 ways to choose a G S^, and if a 7^ 6, we have (|S| — 1)(|S| — 2) 
ways to choose (a, 6) G S?. 

Define the NFA Mqbst = (S, Q, Qo, A) as follows: 

Q = U U Q^,a,^ 

^ = U U ^-.-^-f 
^ = U U ^-.-^.b- 

In words, Mqbst is the union NFA comprised of all the DFAs Mx^a^b'-, note that 
its only source of nondeterminism is that it simultaneously starts in each of 
the start states s^^afi- By design, Mqbst is an NFA recognizing the language 

-^OBST- 

We collect these observations into a theorem. 
Theorem 1. The NFA Mqbst 
(i) recognizes the language Lqbst, 
(a) has 

|S| (7(|S| - 1) + 9(|S| - -2)) G 0(|S|3) 

states, and 

(Hi) can be simulated on w in 0(f|Sp) time and 6(1) space. 

Proof. Item (i) follows from the discussion above. The claim in (ii) follows 
from the calculation in (|2]) and the construction in Figure [H which implies 
iMa- a^al = 7 and |Mj. a_b| = 9. To simulate Mqbst on a string w with the 
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complexity in (iii), our simulator runs each of the DFAs Mx^a,b on w. If 
any of them accept, the simulator accepts; if none accept, it reject. The 
DFAs Mx^a,b can be constructed in constant time and space, sequentially, by 
substituting the appropriate values of x, a, h in the transitions of the generic 
DFAs illustrated in Figure [H □ 

4. Proof of correctness 

So far, we have defined two seemingly unrelated objects: Luniq, the collec- 
tion of uniquely decodable strings, and Lqbst, the language of obstructions. 
We shall now prove that the two are complementary. 

Theorem 2. 

-^UNIQ ~ ^ \ -^OBST- 

We develop the proof with the aid of several lemmata. 

4-1- -^OBST ^ 5] \ /vuNIQ 

The forward direction has the simpler proof, deriving from one lemma. 
Lemma 3. For x G S and a,b & S^j, we have 

Kx,a,b ^ S \ -^UNIQ- 

Proof. By definition, w contains a factor of the form u = axu'b, with u' G Si, 
and a factor of the form v = av'b, with v' G Si. Note that u and v cannot 
overlap, and so w must be of the form w' = auPvf or w" = av^wj for some 
a,/3,7 G S*. Since u and v both start with a and end with b, the bigram 
encodings of w' and w" will be identical, meaning that their preimage string 
w is not uniquely decodable. □ 

4-2. LqbST 12 S \ i^uNIQ 

The proof of the reverse direction draws heavily from the definitions in 
[3] , some of which were reproduced in Section [21 For sake of exposition, we 
note that the weighted inflow and outflow of a node v in the bigram graph 
of a string are given by 

infiow(f) = ^e('U,t>) outfiow(t>) = e{v, u). 



^These are distinct from the weighted in-degree and out-degree in graph theory, in that 
they do not inchide the weights of self-loops. 
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The self-flow of v is simply self-flow(f) = e{v,v). Finally, for an edge 
e{v,w) > 0, we say that f is a parent of w or w is a child of v and de- 
note both with V ^ w. 

In addition, the pruning operator Px{w) deletes all occurrences of the 
letter a; G S from the string w G S*. A vertex a; 7^ $ is removable in a 
bigram graph G 0, Definition 4] if: 

(a) X has a single child b, 

(b) no parent of x has a child b, and 

(c) if X is a child of x, then outflow(x) = 1. 

The removal of a removable node results in a string with the same number 
of decodings as w [7]. Where these x correspond to a node with outflow 1 
in the bigram graph of w, we call them type-I removable; otherwise, we call 
them type-II removable. 

Our first observation is that pruning a removable node preserves obstruc- 
tions: 

Lemma 4. Suppose that G S* induces the bigram graph G{w) with a 
removable node r, and let w' = Pr{w). Then w G Lqbst if (ind only if 

W G Z/QBST- 

Proof. For the forward direction, assume w G Lqbst, meaning that w belongs 
to some Kx,a,b- Note that if r ^ {x, a, b} then w' G K^^afi^ because deleting 
r does not change membership in either Ix,a,b 01 Jx,a,b- Thus, we need only 
consider what happens when one of r G {a, 6, x} is pruned. 

We can rule out the case r = a because a has two distinct children and 
so, by definition, is not removable. For the case r = b,we note that b appears 
at least twice in the string and thus has outflow > 2. For b to be removable, 
it must have a single child b', making w' an element of K^^a^v- 

It remains to consider the case r = x. Recall that w G K^^afi and thus 
contains a factor u = axu'b, with u' G Si. Consider the sub-case where w 
contains ab as a factor. Now if u' = e then x is not removable in G (its 
parent a points to its child b), so assume that u' = x'u" for x' G S \ {x, a, 6} 
and u" G SI. In this case, x might be removable in G, but then w' G K^'^afi- 
Alternatively, suppose w G K^^afi does not contain ab as a factor. It must, 
however, contain the factor v = av'b with v' G S^. \{ u' = e then w' has the 
factor ab and also the factor av'b, and thus belongs to Ky^a,b for some y in v'. 
Otherwise, w' has the factors au'b = au[u2 . . . u'i.b and av'b = av'iv'2 . . . v'^b. 
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We cannot have u[ = v[ , for then w would have the factors axu[ and au[ , and 
X would not be removable in G. If u[ does not occur in v', then w' G K^i^ a.,b- 
If u'l occurs in f', then w' G K^i^^a,u'^- 

The direction w G -ZvQBST ' ' ^ ^ -^OBST is proved analogously. CH 

Before stating the next lemma, we introduce another bit of notation. 
For two nodes a, h (not necessarily distinct) in a given bigram graph, the 
existence of a directed path from a to 6 will be denoted by a ^ 6. If in 
addition there is a directed path from a to 6 avoiding x, we indicate this by 
a ^ h. These relations may be concatenated with the obvious semantics. 
Thus, a — )■ 6 ^ c ^ d implies the existence of a directed path in G that 
takes the edge a ^ b, then reaches c having avoided x between b and c, and 
then reaches d. 

Lemma 5. Suppose the bigram graph G has a node g with distinct children 
x,y such that x ^ g and y ^ g. Then every traversal of G belongs to 
K I) K li K UK 

Proof. Our assumptions on G imply g ^ x ^ g and g ^ y ^ g. We claim 
that least one oi x ^ g, y ^ g must hold. Indeed, suppose that every directed 
path from x to g passes through y — then there is a directed path from y 
to g avoiding x. Consider the case that x ^ g. In this case, we also have 
that G also satisfies at least one of (i) g x ^ g, (ii) g^x^y^x^g. 
Case (i) corresponds to traversals belonging to Ky g g and (ii) corresponds to 
traversals belonging to Ky^x,x- A similar analysis of the case y ^ g proves 
the claim. □ 

Finally, we show that any non-uniquely decodable string must be an ob- 
struction: 

Lemma 6. 

\ IvuNiQ ^ U U ^x,a,b- 

Proof Pick a G S* \ Lunjq. Since w is not uniquely decodable, its bigram 
graph G has more than one valid traversal. Let G' be the graph obtained 
after pruning the removable nodes from G (in some order Vuntil no removable 
nodes are remaining. Then G' is a non-trivial graph [7|, Theorem 9] and 
has the same number of decodings (valid traversals) as G 0, Theorems 5,6]. 
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Furthermore, Lemma H] above implies that a decoding m of G is an obstruction 
iff the corresponding pruned decoding u' of G' is an obstruction. 

Thus, to prove the theorem, it suffices to show that every decoding of 
G' is an obstruction. By construction, G' has no removable nodes, meaning 
that at least one of the following holds for every node g G G", g ^ %'■ 

(i) g ^ a and (7 — )■ 6 for distinct a, 6 G Sg. 

(ii) self-flow((7) > and outflow((7) > 1 

(iii) a ^ g ^ h and a — )■ 6 for a, 6 G 

If (iii) holds for any node g, then every decoding of G' is an obstruction 
of the type Kg^a,b- 

There are two ways that (ii) can hold for any g: (ii') g ^ g and e{g, x) > 1 
or (ii") g ^ g and g ^ x, g ^ y for x ^ y. In case of (ii'), any decoding of 
G' must contain both a factor gg and also a factor gx and a directed path 
from X back to g. Thus, any such decoding belongs to K^^g^g. Similarly, in 
case of (ii"), we have x ^ g or y ^ g, resulting in the decoding belonging to 
K^^g^g or Ky^g^g r c sp c ct 1 vcty. 

It remains to examine the case where every node g satisfies (i). Suppose 
for now that in addition to (? — j- x and g ^ y for x ^ y E T,g we also have 
(? — 7- 2; for some 2; G S \ {g, x, y}. In any decoding of G', at least two of 
{x,y,z} must have a directed path back to g. Lemma [S] then implies that 
every decoding of G' belongs to 

U Kt,g,g U Kt,^g^g U Kt^t',t' U Kt,^t^t. 

Having dispensed with the three-child case and with (ii) and (iii) above, 
the only remaining scenario is that every g $ in G' has exactly 2 children 
and self-flow((yf) = 0. We claim that in this case, there must be a G G' with 
children x ^ y such that x ^ g and y ^ g. If this were not the case, G' 
would be uniquely decodable — since at each node g, we would be obligated 
to first take the unique child that does have a directed path back to g. But 
this contradicts Lemma 8 in which states that a bigram graph where 
every node other than $ has exactly 2 children and no self-flow has multiple 
decodings. Let g be the requisite node with children x ^ y; hj Lemma [5] we 
have that every decoding of G' belongs to K^^g^g U Ky^g^g U K^^y^y U Ky^x,x- □ 

Theorem [2] follows immediately from Lemmas |3] and [6] — in light of which, 
the runtime complexity in Theorem [^iii) can be improved from 0(£|Ep) to 
0(£|S|^), where £ is the length of the shortest prefix u ^ Luniq of w. 
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5. Lower bound for DFAs recognizing L 



UNIQ 



We know from Theorems [T] and [2] that Luniq C S* is a regular language. 
Let us denote the minimum DFA recognizing Lu^iq by In this section 

we examine the size of M°j^jq, as measured by the number of states. In 
Lia and Xie constructed a DFA on 

2l^l(|E| + 1)(|S| + e 2^(1^1 1^1) (3) 

states recognizing Luniq- However, their construction is not optimal: for 
example, when |S| = 3, the left-hand size of ([3]) is equal to 8192 while the 
canonical DFA for Luniq C {a, b, c}* has 84 statesj^l The main result of this 
section is the following lower bound, which is also not tight as it gives a value 
of 4 states for this alphabet size. 



Theorem 7. For ISI > 1, 



\M° I > 21^1"-^ 



Proof. Define =u to be the usual equivalence relation induced on S* by Luniq: 
X =u y if and only if there is no t G S* that distinguishes x from y, meaning 
that xt G i^uNiQ from yt ^ -Z^uniq or vice versa. Then the Myhill-Nerode 
theorem [3] assures us that the number of states in a DFA accepting Luniq is 
at least the number of strings that are pairwise-distinguishable with respect 

to Z^uNIQ. 

Our proof proceeds by induction on the alphabet size, where we con- 
struct a set Di of 2' pairwise-distinguishable strings over the alphabet Sj = 
{(j) : < j < i}, i = 0, 1, 2, . . .. For the base case i = 0, we take Dq = {0}. 

Now suppose, as an inductive hypothesis, that we have constructed the 
set Di of 2* distinguished strings over the alphabet Sj, for i > 0. We then 
define -Dj+i over the alphabet Sj+i as the union Dj+i = DiU D[, where D[ 
simply appends the letter (i + 1) G Sj+i to each string in Di, more precisely, 
D'. = {w {i + l) -.w e Di}. Thus, for example, D2 = {0,01,02,012} and 
D2 = {03, 013, 023, 0123} combine to form D^. Note that the letters always 
appear in w & Di in strictly increasing order, and thus Di C Luniq for all 
i > 0. 
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This may be verified by determinizing, negating, and then minimizing the NFA Mqbst 



constructed in Section [3] or by minimizing the DFA of Lia and Xie 11 1 
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What remains to prove is that the members of Dj+i as constructed above 
are all pairwise distinguishable under =„. In proving that u v for all 
distinct u,v E -Di+i, we consider three cases: (i) both strings belong to Di, 
(ii) both strings belong to D'^, and (iii) one string belongs to and the 
other to D^. For u,v e Di, our inductive hypothesis applies to give u v. 
Consider u,v e D'^. Since the sequences u and v are strictly increasing and 
distinct, there is necessarily a letter x that appears in one and not the other. 
Then u and v are distinguished by xx. To see this, suppose, without loss of 
generality, that x appears in u but not in v, and note that last letter of u 
and V is (i + 1) 7^ x; then vxx e L^niq and uxx ^ I/uniq- 

Finally, consider the case oi u — U1U2 ■ ■ - Uk G and v — V1V2 ■ ■ - Vii e 
D\. We examine two sub-cases. First, suppose the strings uxU2 . . - Uk-i and 
V1V2 ■ ■ -Vg^i are distinct. Let a; be a letter that appears in one and not the 
other. Then u and v are distinguished by xx using the argument above. In 
the other sub-case, we have U1U2 ■ ■ ■ Uk-i — V1V2 ■ ■ ■ ve-i — w. Then u and 
V are distinguished by t — wUkW. Indeed, ut = wUkWUkW e -^uniq, while 
can be decoded as v' — wv£WUkW or as v" — wu^wviw. 

□ 

6. Discussion 

We have provided a novel, constructive proof that Luniq is a regular lan- 
guage, which yields as a by-product a 0(|S|^)-sized NFA recognizing Luniq 
that can be efficiently simulated. We have also shown that the minimum 
DFA has 2-^(1^1) states, where 

n — 1 < fin) < Cn log n 

for some universal constant C. The exact growth rate of f{n) is an intriguing 
open problem. 
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